Detecting Website Redesigns via Template Similarity on Streams of Documents

نویسنده

  • Thomas Gottron
چکیده

Most websites undergo a redesign from time to time. Along with the change of the appearance of the site comes a different document structure. Hence, redesigns can be detected by observing changes in the structural similarity of monitored HTML documents. Assuming further to monitor not a fixed document set but a series of the newest documents (e.g. provided by an RSS feed) transforms the task of redesign detection into a particular change detection operation on streams of documents. This paper describes and evaluates a simple and three more elaborated approaches to the problem. We show that the detection of redesigns can be achieved automatically, effective and efficient.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Evaluation of Similarity Measures for Template Matching

Image matching is a critical process in various photogrammetry, computer vision and remote sensing applications such as image registration, 3D model reconstruction, change detection, image fusion, pattern recognition, autonomous navigation, and digital elevation model (DEM) generation and orientation. The primary goal of the image matching process is to establish the correspondence between two ...

متن کامل

Spatial Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams

Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spati...

متن کامل

An Automated Approach to Categoriz he Web Documents through Text Mining

With the increased access of the internet, it has become obvious for all small and big organizations to have an efficacious web presence to acquaint users with the identity of the enterprise. Now a day’s daily routine work of large organization such as communication, document distribution, tender declaration such as notices circular etc is done via websites. Web pages of a website are divided i...

متن کامل

HTHquery: a method for detecting DNA-binding proteins with a helix-turn-helix structural motif

SUMMARY HTHquery is a web-based service to determine if a protein structure has a helix-turn-helix structural motif which could bind to DNA. It is based on a similarity with a set of structural templates, the accessibility of a putative structural motif and a positive electrostatic potential in the neighbourhood of the putative motif. A set of scores are computed, based on each template, using ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009